AFRL-IF-RS-TR-2006-79 
Final  Technical  Report 
March  2006 


PHYSICO-CHEMICAL  PROKARYOTE  MODELS: 
STAND-ALONE  MODULES  AND  KARYOTE 
INTEGRATION 

Indiana  University 


Sponsored  by 

Defense  Advanced  Research  Projects  Agency 


DARPA  Order  No.  P127 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED. 


The  views  and  conclusions  contained  in  this  document  are  those  of  the  authors  and  should  not  be 
interpreted  as  necessarily  representing  the  official  policies,  either  expressed  or  implied,  of  the 
Defense  Advanced  Research  Projects  Agency  or  the  U.S.  Government. 


AIR  FORCE  RESEARCH  LABORATORY 
INFORMATION  DIRECTORATE 
ROME  RESEARCH  SITE 
ROME,  NEW  YORK 


STINFO  FINAL  REPORT 


This  report  has  been  reviewed  by  the  Air  Force  Research  Laboratory,  Information 
Directorate,  Public  Affairs  Office  (IFOIPA)  and  is  releasable  to  the  National  Technical 
Information  Service  (NTIS).  At  NTIS  it  will  be  releasable  to  the  general  public, 
including  foreign  nations. 

AFRL-IF-RS-TR-2006-79  has  been  reviewed  and  is  approved  for  publication. 


APPROVED:  /s/ 


ROBERT  L.  KAMINSKI 
Project  Engineer 


FOR  THE  DIRECTOR:  /s/ 


WARREN  H.  DEBANY,  Technical  Advisor 
Information  Grid  Division 
Information  Directorate 


DESTRUCTION  NOTICE  -  For  classified  documents,  follow  the  procedures  in  DOD 
5200.22M.  Industrial  Security  Manual  or  DOD  5200. 1-R,  Information  Security  Program 
Regulation.  For  unclassified  limited  documents,  destroy  by  any  method  that  will  prevent 
disclosure  of  contents  or  reconstruction  of  the  document. 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
OMB  No.  074-0188 


Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  this  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information,  including 
suggestions  for  reducing  this  burden  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302, 
and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503 _ 


1.  AGENCY  USE  ONLY  (Leave  blank)  2.  REPORT  DATE  3.  REPORT  TYPE  AND  DATES  COVERED 

MARCH  2006  Final  Sep  02  -  Sep  05 


4.  TITLE  AND  SUBTITLE  5.  FUNDING  NUMBERS 

PHYSICO-CHEMICAL  PROKARYOTE  MODELS:  C  -  F30602-02-2-0001 


STAND-ALONE  MODULES  AND  KARYOTE  INTEGRATION 


6.  AUTHOR(S) 
P.  Ortoleva 


PE  - 
PR  - 
TA  - 
WU  - 


61 1 01 E 
BIOC 
PI 
27 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Indiana  University 
P.  O.  Box  1847 

Bloomington  Indiana  47402-1847 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


9.  SPONSORING  /  MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

Defense  Advanced  Research  Projects  Agency  AFRL/IFGA 
3701  North  Fairfax  Drive  525  Brooks  Road 

Arlington  Virginia  22203-1714  Rome  New  York  13441-4514 


10.  SPONSORING  /  MONITORING 
AGENCY  REPORT  NUMBER 

AFRL-IF-RS-TR-2006-79 


11.  SUPPLEMENTARY  NOTES 


AFRL  Project  Engineer:  Robert  L.  Kaminski/IFGA/(315)  330-1867/  Robert.Kaminski@rl.af.mil 


12a.  DISTRIBUTION  /  AVAILABILITY  STATEMENT 

APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED. 


12b.  DISTRIBUTION  CODE 


13.  ABSTRACT  (Maximum  200  Words) 

At  the  Center  for  Cell  and  Virus  Theory  (CCVT)  several  types  of  systems  biology  modules  were  developed  for  the  Bio- 
SPICE  project.  The  cell  modeling  modules  account  for  metabolic,  proteomic  and  genomic  kinetics  and  their  spatially 
localized,  multiple  scale  character.  Modules  for  model  development,  calibration  and  multiplex  (e.g.  genome-wide 
expression)  data  interpretation  are  also  provided.  Models  are  made  available  via  three  complementary  mechanisms:  1) 
the  Bio-SPICE  system;  2)  open  source  stand-alone  code;  and  3)  a  website  (sysbio.indiana.edu)  run  by  CCVT.  The  latter 
can  also  be  run  through  the  Bio-SPICE  Dashboard.  CCVT  software  has  been  demonstrated  in  a  variety  of  contexts 
including  transcriptional  regulatory  networks  in  prokaryotes  and  B  cells,  self-organized  division  mechanisms  in  E.  coli, 
glycolysis  in  yeast  and  T.  brucei,  and  the  transcriptional  response  of  human  cells  subjected  to  toxins.  As  CCVT  also  has 
an  educational  function,  a  number  of  graduate  and  undergraduate  students  have  been  trained  as  the  next  generation  of 
systems  biologists. 


14.  SUBJECT  TERMS 


Bio-SPICE,  gene  expression,  cell  modeling,  data/model  integration,  kinetics,  diffusion, 
transcription  factor,  transcriptional  regulatory  network 


17.  SECURITY  CLASSIFICATION 
OF  REPORT 


18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 


19.  SECURITY  CLASSIFICATION 
OF  ABSTRACT 


15.  NUMBER  OF  PAGES 

35 


16.  PRICE  CODE 


20.  LIMITATION  OF  ABSTRACT 


UNCLASSIFIED 


NSN  7540-01-280-5500 


UNCLASSIFIED 


UNCLASSIFIED 


Standard  Form  298  (Rev.  2-89) 

Prescribed  by  ANSI  Std.  Z39-18 
298-102 


Table  of  Contents 


1.0  Summary . 1 

2.0  Introduction . 3 

3.0  Karyote  Cell  Analyzer . 4 

4.0  Transcriptional  Regulatory  Network  Construction . 9 

5.0  KAGAN :  Karyote  Gene  Analyzer . 18 

6.0  GeneDat:  Human  and  Bacterial  Transcriptional  Regulatory  Database . 23 

7.0  CellX:  Multi-Dimensional  Cell  Model . 24 

8.0  Conclusions . 27 

9.0  Recommendations . 28 

10.0  References . 29 


i 


List  of  Figures 


Figure  1  -  Cell  divided  into  compartments . 5 

Figure  2:  Probability  distribution  for  correlation  (Pearson)  between  a  random  pair  and 

known  F/generegulatory  interaction  for  E.coli . 1 1 

Figure  3:  Properties  of  TRNs  used  in  the  synthetic  examples . 13 

Figure  4:  Effect  of  TRN  properties . 13 

Figure  5:  Reconstruction  of  TRNs . 14 

Figure  6:  Probability  distribution  as  a  function  of  a)  GO,  b)  phylogenic  similarity,  c)  FTF 

scores . 16 

Figure  7:  Probability  distributions  for  the  final  score . 17 

Figure  8:  Predicted  TF  activity  time  courses  for  16  of  38  TFs  constructed  using  our  module  of 

C3  and  a  preliminary  TRN  (from  www.ecocyc.com  and  gene  expression  data) . 22 

Figure  9a:  Surface  pole-to-pole  MinD  concentration  profile  for  a  nonnal  length, . 25 

Figure  9a:  Surface  pole-to-pole  MinD  concentration  profile  for  a  normal  length,  rod  shaped 

E.coli  cell,  suggesting  one  division  plane  at  the  middle . 25 

Figure  9b  Same  as  (a)  except  for  a  1.5  x  normal  length  cell . 26 

Figure  9c  Same  as  (a)  except  for  a  2  x  normal  length  cell . 26 

List  of  Tables 

Table  1.  Units  for  input  or  output  variables  in  KCA . 5 

ii 


1.0  Summary 

Systems  biology  modules  were  developed  for  Bio-SPICE.  The  modules  were  of  two  types  -  cell 
models  and  data/model  integrators.  Selected  modules  were  made  available  through  Bio-SPICE  as 
components  of  the  Dashboard  while  others  were  designed  to  be  stand-alone  or  available  as  a  web 
service. 

Cell  models  are  Karyote  and  CellX.  The  former  is  an  ordinary  differential  equation  model  of 
a  compartmented  cell  which  can  simulate  both  eukaryotic  and  prokaryotic  cells.  The  reactions 
allowed  in  each  compartment  and  the  membrane  transport  processes  accounted  for  are  general  in 
character.  A  companion  website  ( sysbio.indiana.edu )  was  set  up  that  allows  users  to  create  fdes 
for  Karyote  input.  These  include  single  and  multiple  cell  (suspension  and  tissue)  models. 

Karyote  was  demonstrated  using  yeast  and  the  parasite  T.  brucei,  the  causative  agent  in  sleeping 
sickness. 

CellX  is  a  partial  differential  equation-based  cell  model.  In  its  Bio-SPICE  implementation  it 
is  designed  for  prokaryotic  cell  simulation.  Reaction-transport  equations  are  solved  in  the  cell 
interior  (3-D),  on  or  within  the  cell  membrane  (2-D),  and  a  boundary  continuity  equation 
accounts  for  processes  of  exchange  between  the  interior  and  the  membrane.  CellX  was 
demonstrated  for  the  self-organized  plane  of  division  in  E.  coli  via  Min  protein  reaction-transport 
processes. 

Data/model  integration  modules  were  also  developed.  These  modules  allow  a  user  to  directly 
extract  cell  modeling  information  from  multiplex  data  (e.g.  cDNA  microarray,  NMR  and 
proteomics).  The  microarray-based  modules  were  made  part  of  Bio-SPICE.  One  of  these 
modules,  FTF,  is  designed  to  extract  transcriptional  regulatory  information  from  cDNA 
microarray  data  in  time  series  or  steady  states  for  cells  in  various  extracellular  conditions.  The 
KAGAN  module  uses  cDNA  microarray  data  to  refine  a  transcriptional  regulatory  network  and 
calibrate  associated  rate  and  binding  constants.  Our  transcriptional  regulatory  network 
construction  modules  (FTF  and  KAGAN)  are  built  on  the  estimation  of  transcription  factor 
profiles.  Most  other  methods  of  microarray  data  analysis  are  based  on  the  assumption  that 
protein  profiles  are  in  step  with  profiles  of  the  encoding  RNA  -  an  assumption  that  has  been 
shown  experimentally  to  be  untrue  in  a  number  of  cases  where  detailed  proteomics  and  RNA 
expression  data  were  both  available. 
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FTF  is  highly  CPU  efficient  so  that  many  networks  can  be  tested  to  arrive  at  one  which  is 
most  consistent  with  the  available  microarray  data.  We  demonstrated  that  FTF  is  ideally  nested 
in  an  overall  workflow  aimed  at  transcriptional  regulatory  network  construction.  In  this  way 
FTF,  combined  with  promoter  analysis,  gene  ontology  and  phylogenic  similarity  can  be  used  to 
greatly  increase  the  number  of  transcriptional  factor/gene  regulatory  interactions  discovered. 

FTF  with  gene  ontology  was  used  with  a  dataset  of  336  microarrays  on  B  cells  to  create  a  very 
large  network  for  these  cells  (posted  at  sysbio.indiana.edu). 

KAGAN,  the  second  microarray-based  transcriptional  regulatory  network  construction 
module  developed  and  installed  at  Bio-SPICE,  can  refine  an  input  network  and  calibrate  the 
transcription  and  RNA  degradation  rate  constants,  as  well  as  transcription  factor/gene  binding 
constants. 

As  FTF  is  optimized  for  network  construction,  and  KAGAN  is  designed  to  take  a  network 
and  refine  and  calibrate  the  associated  biochemical  kinetic  parameters,  they  are  ideally  suited  for 
a  two  step  network  inference  workflow.  This  has  been  implemented  at  sysbio.indiana.edu. 

For  FTF,  KAGAN,  and  our  bioinformatics  modules,  it  is  necessary  to  have  a  preliminary 
network/training  set.  To  serve  users  we  have  created  the  GeneDat  database  containing  over 
13,000  transcription  factor/gene  regulatory  interactions  for  mammalian  (mostly  human)  cells.  It 
also  contains  what  we  believe  to  be  state-of-the-art  network  information  for  E.  coli  and  B. 
subtilis.  With  each  transcription  factor/gene  regulatory  interaction  a  variety  of  annotations  are 
provided. 
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2.0  Introduction 


The  modeling  systems  developed  in  this  Bio-SPICE  project  have  distinct  levels  of  physics  and 
chemistry  according  to  the  phenomena  they  address.  In  this  report  the  Bio-SPICE  modules 
developed  at  CCVT  are  described  in  some  detail.  There  modules  are  as  follows. 

The  Karyote  system  solves  ordinary  differential  equations:  a  cell  is  divided  into 
compartments  within  each  of  which  user-specified  reactions  take  place  and  between  which 
molecules  are  exchanged  by  active  and  passive  processes.  A  detailed  tutorial  with  15  instructive 
mechanistic  models  is  provided  as  are  datasets  for  yeast  glycolysis  and  T.  brucei  (the  causative 
agent  of  sleeping  sickness)  oxidative  and  anoxic  glycolysis.  Karyote  has  special  features  such  as 
the  ability  to  construct  models  involving  suspensions  of  various  types  of  cells  (e.g.  blood)  and 
the  unification  of  two  cell  models  into  one,  more  comprehensive  one.  Karyote  is  SBML 
compatible  (i.e.  SBML  files  can  be  created  for  use  by  other  simulators  or  received  from  others  as 
input  files). 

FTF  and  KAGAN  are  unique  microarray  data  analysis  modules  for  the  construction  of  gene 
regulatory  networks.  FTF  focuses  on  network  structure  (i.e.  it  delineates  the  transcription  factors 
regulating  each  gene).  KAGAN  focuses  on  calibrating  rate  coefficients  for  transcription  and 
RNA  degradation,  transcription  factor/gene  binding  constants,  and  network  structure  refinement. 

GeneDat  is  a  database  of  over  13,000  experimentally-verified  up/down  transcription 
factor/gene  regulatory  interactions  (mostly  human)  annotated  with  species,  cell  line,  and  data 
source.  Software  associated  with  GeneDat  launches  queries  that  automatically  provide  a 
preliminary  regulatory  network  for  FTF,  KAGAN,  or  other  network  analysis/improvement 
modules. 

CellX  has  most  of  the  features  of  Karyote.  In  addition,  3-D  concentration  distributions  within 
the  cell  and  2-D  distributions  along  the  cell  inner  surface  are  computed  using  finite  element 
methods. 
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3.0  Karyote  Cell  Analyzer 

Overview 

The  Karyote  Cell  Analyzer  (KCA)  is  based  on  the  principles  of  chemistry  and  physics  as 
formulated  for  single  and  multiple  cell  systems.  It  is  designed  for  fundamental  studies  into  the 
workings  of  a  cell  and  for  applications  of  this  understanding  in  drug  and  vaccine  design, 
treatment  optimization,  refined  diagnoses  of  complex  diseases  like  cancer,  environmental 
analysis,  and  biotechnical  process  engineering.  The  input  to  KCA  is  the  network  of  biochemical 
processes  and  rate/equilibrium  data,  intracellular  architecture,  membrane  transport  properties, 
initial  cell  state  and  conditions  in  the  extracellular  medium.  The  output  of  KCA  is  the 
concentration  timecourse  of  all  chemical  species  in  all  compartments  (and  all  cells  for  a 
multicellular  simulation). 

Karyote  Cell  Analyzer  Functionality 

KCA  simulates  compartmentalized  cellular  reaction- transport  processes.  Ordinary  differential 
equations  are  solved  in  each  compartment  and  active  and  passive  molecular  transfer  between 
compartments  is  accounted  for.  All  processes  are  fully  coupled  through  the  dependence  of  rate 
laws  on  composition.  Solution  techniques  include  multiple  timescale  analysis  and  a  stiff  solver 
package  with  method  switching  for  efficiency.  Reactions  designated  as  fast  are  maintained  close 
to  equilibrium  or  steady  state  as  coupled  processes  may  indicate.  Mass  conservation  errors  made 
by  assuming  rate  laws  in  the  form  of  polynomial  ratios  are  avoided.  Both  rate  and  equilibrium 
constants  must  be  supplied  for  finite  rate  processes  while  only  equilibrium  constants  are  to  be 
provided  for  fast,  equilibrated  ones;  when  subsets  of  fast  reactions  are  in  steady  state  balance 
(with  a  nonzero  net  overall  rate)  both  forward  and  reverse  rate  constants  must  be  provided. 

Many  physico-chemical  processes  underlying  cell  behavior  are  accounted  for  in  KCA.  In 
these  notes  we  familiarize  the  user  with  them  by  illustrating  how  to  run  KCA  via  examples  as 
described  in  Explaining  KCA  Though  Simple  Cell  Models,  below.  In  KCA  one  may  consider  a 
cell  to  be  a  single  compartment  reactor  or  may  divide  it  into  compartments  within  each  of  which 
specialized  reactions  take  place.  One  may  build  very  extensive  reaction  networks,  including 
equilibrated  reactions  or  cycles  of  reactions  in  steady-state  balance  that  have  great  complexity. 
One  may  vary  parameters  in  the  membrane  flux  laws  or  create  transcompartmental  reactions  that 
simultaneously  involve  chemical  species  on  both  sides  of  a  membrane  (e.g.  catalysis  or  ion 
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pumps).  As  KCA  computations  are  hierarchical  one  may  construct  systems  with  compartments 
within  compartments  and  thereby  model  multi-cellular  systems  (e.g.  cell  suspensions,  tissues  or 
embryos). 


Quantity 

Units 

Time 

seconds 

Volume 

liters 

Area 

(decimeters)2 

Concentration 

millimole/liter 

equilibrium  constant  for  m-th  order  forward.  «-th  order  reverse  action 

(millimole/liter)n"m"1sec'1 

rate  coefficient  for  n-th  order  reverse  action 

(liter/millimole)n- 1 

K  membrane  permeability  factor 

millimoles/liter 

VM  maximum  permeability 

decimeter  (sec)-l 

(f)  membrane  asymmetry  parameter 

none 

Table  1.  Units  for  input  or  output  variables  in  KCA. 


The  nature  of  a  KCA  cell  model  is  suggested  in  Fig.  1.  The  system  is  divided  into 
compartments  (i.e.  cytosol  and  organelles).  These  subsystems  are  separated  by  membranes 
across  which  molecules  can  be  exchanged.  Molecules  can  only  exchange  between  compartments 
that  the  user  specifies  to  have  a  common  membrane  of  given  surface  area;  thus  surface  areas 
Aaa  between  compartments  a  and  a' are  used  to  define  system  configuration,  and  similarly  for 
the  volume  V°  of  compartment  a .  This  allows  the  user  to  define  complex  intra-cellular 
architectures  or  multi-cellular  systems  (i.e.  consisting  of  compartments  within  compartments  in  a 
hierarchical  fashion).  Many  of  the  general  chemical  kinetics  concepts  used  in  KCA  are  reviewed 
in  the  following  sections  and  in  a  mini-course  on  Chemical  Kinetics  available  at 
sysbio.  indiana.  edu . 


Figure  1  -  Cell  divided  into  compartments 
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Figure  1:  InKaryote  the  cell  is  divided  into  compartments  labeled  oc  =  l,2,---  separated  by 
membranes.  For  each  compartment  ordinary  differential  reaction-transport  equations  are  used  to 
simulate  the  evolution  of  descriptive  variables  (e.g.  concentrations  and  electrical  potentials). 

Mathematical  Formulation 

Publications  that  contain  more  technical  details  about  the  KCA  include  Ortoleva  et  al.  (2003); 
Weitzke  and  Ortoleva  (2003);  Sayyed-Ahmad  et  al.  (2003);  and  Navid  and  Ortoleva  (2004). 
These  and  other  papers  are  available  through  our  website  ( sysbio.indiana.edu ). 

In  KCA  the  cell  is  divided  into  Nc  compartments  labeled  a  =  1,2,- ■  ■  Nc .  In  compartment  a  , 

each  molecular  species  i  =  1,2,- -N  is  described  by  its  concentration  c“  .  Conservation  of  mass 
implies 


Aaa'  =  boundary  surface  area  separating  compartments  a  and  a' 

J"a'  =  net  flux  of  species  i  from  a'  to  a[=  -J''u  j 

N,  Nc,Nr  =  number  of  chemical  species,  compartments  and  reactions,  respectively 

Va  =  volume  of  compartment  a 

W“  =  rate  of  reaction  k  in  compartment  a 

v“  =  stoichiometric  coefficient  for  species  i  in  reaction  k  in  compartment  a  . 

In  KCA  the  last  tenn  is  divided  into  fast  and  slow  contributions  to  take  advantage  of  multiple 
scale  methods.  Let  s  be  a  small  parameter.  Then  the  net  reaction  rate  (the  last  tenn  in  (1) 

NS  J  Nf 

divided  by  Va ),  denoted  R" ,  is  written  R"  =  £  v,;Wlas  +  j£tffw,af . 

1=1  1=1 

As  s  — »  0  the  system  is  driven  close  to  equilibrium  (so  that  W,aJ  =0)  or  linear  combinations  of 
the  vanish  (to  express  steady-state  cycles).  Furthermore  minority  species  (e.g.  enzymes) 
impart  another  element  of  stiffness  to  the  cell  simulation  problem.  Thus  in  KCA  species  are 
divided  into  majority  and  minority  categories.  The  latter  have  concentrations  that  scale  with  s  in 
our  formalism.  The  resulting  multi-scale  analysis  avoids  difficulties  in  numerical  simulations, 
especially  when  minority  species  participate  in  fast  reactions.  This  minority/majority  separation 
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allows  for  greater  computational  efficiencies.  As  the  use  of  this  option  places  more  burden  on  the 
user  (i.e.  discriminating  minority  versus  majority  species),  we  do  not  include  it  in  this  Bio- 
SPICE  release. 

Rates  of  reaction  are  of  the  mass-action  form: 

Rate  =  k  Q\\  cv.‘  ]~ [  c  '  (2) 

Vj>0  Vj<0 

where  k  is  the  reverse  rate  coefficient,  Q  is  the  equilibrium  constant  (i.e.  k(Q  is  the  forward  rate 
coefficient)  and  v.  is  the  stoichiometric  coefficient  (i.e.  v  >  0  for  products  and  <  0  for  reactants. 
The  passive  flux  law  is  in  the  fonn 


J"mme  =h[c'-c\ 

(3) 

K2  V 

Jj  —  «  « 

[K+K,Xc  +  c')+^c'] 

(4) 

for  a  given  membrane  and  species;  c  and  c'  are  concentrations  on  either  side  of  the  membrane 
(see  Table  1  for  other  variables). 

Intramembrane  enzymatic  and  active  transport  processes  are  accounted  for  as 
transcompartmental  reactions.  For  example, 

2 A( cytosol )  +  B ( mitochrondrion )f!  C ( cytosol )  +  3D ( mitochrondrion ) 

Again  the  rate  law  is  assumed  to  be  of  the  mass  action  form.  The  dependence  of  the  rate  of  such 
processes  on  the  area  of  the  membrane  is  accounted  for,  i.e.  the  transcompartmental  reactions 
and  passive  flux  law  contributions  are  added  in  computing  Jaa' . 

The  user  can  choose  between  explicit  integration,  Runge-Kutta  integration  (both  using 
multiple  timescale  separation  techniques  discussed  in  Ortoleva  (1992)  and  Weitzke  and  Ortoleva 
(2003)),  and  implicit  integration.  We  use  the  double  precision  VODE 
( www.llnl.gov/CASC/download/download_home.html )  as  the  implicit  integrator. 

The  Web-based  Karyote  Cell  Modeling  System 

The  Karyote  Cell  Modeling  System  (KCMS)  performs  a  multiplicity  of  functions  in  support  of 
DARPA  Bio-SPICE  activities  by  familiarizing  beginning  and  advanced  researchers  with 
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concepts  in  cell  modeling.  As  new  computational  biology  modules  are  developed  and  tested  at 
CCVT  they  are  in  the  web-based  system  at  sysbio.indiana.edu. 

Essential  features  of  the  web-based  KCMS  system  include: 

•  Cell  Assembler  (to  build  a  cell  model  and  create  an  SBML  KCA  input  file); 

•  Cell  Unification  (to  integrate  existing  cell  models,  pathways  or  subsystems  into  a  more 
comprehensive  model  and  create  an  SBML  file); 

•  Multi-Cellular  System  (to  build  a  suspension  or  other  multi-cellular  configuration  from 
single  cell  models  and  create  an  SBML  file); 

•  Information  Theory  (to  calibrate  a  model  using  NMR,  spectral,  microarray,  electrical 
potentiometry  and  other  datasets  individually  or  simultaneously  in  static  or  time  series  - 
only  micro  array  data  analysis  is  installed  at  this  writing);  and 

•  Run  the  Cell  Model  (to  derive  information  about  the  response  of  a  cell  to  changes  in 
extra-cellular  conditions  and  other  stimuli  or  interaction  with  other  cells). These  functions 
are  continuously  being  improved. 
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4.0  Transcriptional  Regulatory  Network  Construction 


Overview 

Two  modules  for  constructing  transcriptional  regulatory  networks  using  cDNA  microarray  data 
were  developed  for  Bio-SPICE.  In  this  section  they  are  described  and  it  is  shown  how  they  can 
be  used  in  a  wider  strategy  that  introduces  several  bioinfonnatics  modules  in  order  to  close  the 
gap  between  the  great  complexity  of  a  transcriptional  regulatory  network  and  the  infonnation 
needed  to  do  so. 

A  number  of  techniques  have  been  proposed  to  infer  transcriptional  regulatory  networks 
(TRNs)  using  cDNA  microarray-monitored  expression  profiles.  Among  them  are  principal 
component  analysis  (Holter  et  al.  2000,  2001)  and  independent  component  analysis 
(Liebermeister  2002).  Network  component  analysis  (NCA)  differs  from  other  techniques  in  that 
the  structure  of  the  gene  regulatory  network  is  assumed  to  be  known  (Liao  et  al.  2003). 

Therefore,  NCA’s  use  is  limited  to  cases  in  which  the  network  is  fairly  well  known  and  has 
strong  structural  limitations.  In  reality,  only  an  incomplete  and  possibly  biased  TRN  is  available 
for  any  cell  due  to  the  experimental  conditions  imposed.  Gardner  et  al.  (2003)  proposed  a 
methodology  to  construct  the  gene-gene  control  network  structure  of  small  networks  using 
microarray  data,  limiting  the  number  of  interactions  per  gene.  We  tested  a  similar  approach  for 
large  networks  and  showed  that  even  when  there  are  just  a  few  interactions  per  gene,  there  can  be 
thousands  of  networks  that  are  consistent  with  a  given  microarray  dataset  to  within  essentially 
the  same  accuracy.  Kyoda  et  al.  (2000)  developed  a  methodology  that  employs  mutation 
experiments  to  arrive  at  the  TRN.  However,  it  is  questionable  whether  their  approach  can  be 
applied  to  large  TRNs.  Liang  et  al.  (1998)  presented  a  methodology  for  Boolean  networks  and 
applied  it  to  a  small  50  gene  system  with  at  most  3  interactions  per  gene.  Boolean  networks  are 
an  oversimplification  of  gene  expression  as  they  use  a  binary  approximation  fully  on  or  off 
(Huang  1999).  Cluster  analysis  is  based  on  statistical  techniques  wherein  correlations  are  sought 
between  the  responses  of  genes  (e.g.  Azuaje  2002;  Bolshakova  and  Azuaje  2003).  However  the 
coordination  can  be  extremely  complex  and  circuitous,  i.e.  genes  may  be  involved  in  a  multi¬ 
branch  feedback  loop  involving  several  transcription  factors  (TLs)  made,  or  activated/deactivated 
by  their  resulting  translated  proteins.  These  time-delayed,  complex  relationships  are  revealed  by 
our  method  as  it  discovers  and  quantifies  many  of  these  feedback  relationships.  Although  cluster 
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analysis  might  suggest  groups  of  genes  that  have  related  functionalities,  it  is  not  an  accurate 
methodology  for  suggesting  TF/gene  regulatory  interactions.  D’Haeseleer  et  al.  (2000)  applied 
clustering  based  on  correlation  of  microarray  data.  To  assess  the  feasibility  of  inferring  networks 
using  expression  data  only,  we  used  two  independent  gene  expression  datasets  and  a  TRN  for 
E.coli  ( http://ecocyc.com ).  Fig.  2  shows  the  probability  of  correlation  between  two  random  genes 
and  that  for  known  TF/gene  interactions.  The  similarity  of  these  distributions  demonstrates  that  a 
successful  reconstruction  of  the  network  using  expression  data  alone  does  not  seem  likely. 
Mutual  information  (Basso  et  al.  2005)  seems  to  have  similar  limitations. 

If  TRN  construction  from  microarray  data  is  unfeasible  because  of  the  insufficient 
information  in  this  data,  then  the  solution  is  to  use  as  much  additional  information  as  possible  to 
rule  out  spurious  networks.  Segal  et  al.  (2003)  assumed  that  genes  in  the  same  pathway  are 
similarly  regulated  and  their  protein  products  often  interact.  This  led  them  to  the  use  of  protein- 
protein  interaction  information  in  their  predictions.  Brazma  et  al.  (1998)  studied  the  similarities 
of  the  upstream  regions  of  genes  that  have  a  similar  expression  profile.  A  similar  study  was 
presented  by  Haverty  et  al.  (2004)  who  used  statistical  methods  for  identifying  overabundant  TF 
binding  motifs  (from  TRANSFAC  and  JASPER)  and  microarray  data  to  infer  the  TRN.  The 
methodology  we  have  developed  is  the  only  one  that  computes  TF  activity  profiles,  correlates 
them  with  microarray  monitored  RNA  profiles,  and  integrates  the  results  with  promoter,  gene 
ontology,  and  phylogenic  analyses  as  follows. 

Network  inference  using  a  similarity  measure  assumes  that  the  activity  of  a  TF  is  represented 
by  the  expression  of  its  encoding  gene.  Failure  to  observe  such  a  high  correlation  for  E.  coli  (Fig. 
2)  shows  that  this  assumption  does  not  hold.  Therefore,  in  order  to  use  expression  data  to 
construct  a  TRN,  we  estimate  TF  activities  independent  of  the  expression  profile  of  the  encoding 
gene.  This  is  a  major  shift  in  strategy.  Our  approach  not  only  suggests  highly  probable  TF/gene 
interactions,  but  also  TF  activities  which  can  be  used  to  establish  the  sense  (up  versus  down)  of 
the  regulation  and  to  explore  post-translational  reactions  that  create  or  modify  TFs. 
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Figure  2:  Probability  distribution  for  correlation  (Pearson)  between  a  random  pair  and  known 

F/generegulatory  interaction  for  E.coli. 


Square  markers  refer  to  the  dataset  obtained  from  the  U.  of  Oklahoma  E.coli  database  (89  datasets; 
http://chase.ou.edu/macro/).  Diamond  markers  refer  to  the  datasets  obtained  from  the  NIH  omnibus  service  (GSE7, 
GSE8,  GSE9;  65  datasets).  The  solid  and  hollow  markers  show  the  probability  distribution  for  correlation  between  a 
random  gene  pair  and  known  gene/TF  regulatory  interaction,  respectively.  As  these  probability  distributions  are 
indistinguishable,  it  does  not  seem  feasible  to  construct  the  TRN  using  expression  data  alone.  We  also  calculated 
probability  distributions  for  mutual  information  which  yielded  similar  findings. 


FTF :  A  Statistical  Approach  to  Estimate  TF  Activity  Profiles 

In  designing  our  microarray-based  TRN  discovery  approach,  we  addressed  the  following 
challenges; 

•  omnipresent  noise/uncertainty  in  the  data; 

•  vastness  of  the  TRN; 

•  many  regulatory  mechanisms  (e.g.  from  TF/gene  binding  to  phosphorylation  and  histone 
interactions);  and 

•  sparseness  of  the  data  (relative  to  the  vastness  of  a  TRN)  imposed  by  the  cost  of 
microarray  data  acquisition. 

Thus,  we  have  developed  FTF  (Fast  Transcription  Factor)  for  network  construction  via  TF 
activity  estimation,  statistical  arguments,  and  a  preliminary  TRN.  FTF  is  based  on  the  following 
notions; 
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•  gene  expression  data  is  usually  error-prone  and  thus  some  consensus  method  is  needed 
whereby  results  from  a  variety  of  genes  are  synthesized  to  derive  regulatory  information 
on  a  given  gene; 

•  a  method  based  on  TFs  has  the  advantage  that  microarray  noise,  and  error  in  a  user- 
supplied  TRN,  can  be  overcome  by  statistics  —  i.e.  the  regulation  of  many  genes  by  a 
given  TF; 

•  due  to  data  uncertainty  sparsity,  there  is  not  usually  sufficient  information  to 
simultaneously  obtain  the  structure  of  the  TRN  and  the  associated  transcription  and  RNA 
degradation  rate  coefficients  (as  needed  in  steady-state  or  time-dependent  chemical 
kinetic  methods); 

•  network  discovery  requires  many  automated  trials  of  possible  networks  to  identify  those 
that  are  most  consistent  with  the  data,  so  the  algorithm  must  be  extremely  efficient;  and 

•  thus  the  objective  of  FTF  is  to  discover  the  structure  of  the  TRN  by  taking  advantage  of 
the  statistical  robustness  allowed  by  a  TF-based  statistical  analysis. 

The  essential  equation  on  which  FTF  is  based  was  arrived  at  empirically  after  extensive 
numerical  experimentation  with  synthetic  data  for  which  we  know  the  TRN,  TF  activities,  and 
the  statistics  of  noise  added  to  the  expression  data: 


i=l 


where  7j(  =  activity  of  TF  n  at  condition  or  time  r ,  m'  =  microarray  response  for  gene  i  at 
condition  r ,  bin  =TRN  ( bm  =  +1/-1  for  gene  i  up/down  regulated  by  TF  n  ,  bin  =  0  for  no 
regulation), H(x)  =  ±1  for  x  >  or  <  0  ,  =  0  for  x  =  0,  and  *F.n  =  2Li  /( M t2Li~ !)  for  =number  of 
TFs  controlling  gene  i ,  and  Mn  =  number  of  genes  TF  n  regulates.  If  there  are  NcDNA  time 
points  or  conditions,  then  one  can  write  NcDNA  x  (NcDNA  +1)1 2  equations  for  the  NcDNA  activities 
7jj  r  =  1,2,-- •  N  DNA  ,  for  each  of  the  NTF  TFs.  Therefore  TF  activities  are  obtained  from  the 
solution  of  (5)  via  a  least  squares  fit. 

Our  synthetic  examples  with  large  TRNs  show  that,  despite  the  simplicity  of  this  approach,  the 
constructed  TF  activities  are  reliable.  For  example,  for  a  TRN  that  has  the  properties  shown  in 
Fig.  3,  even  when  we  eliminate  50%  of  the  TRN  to  create  a  “preliminary  TRN”,  90%  of  the 
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constructed  TF  activities  have  a  correlation  coefficient  of  at  least  0.70  with  the  actual  TF 
activities  used  to  generate  the  synthetic  expression  data  (with  20  or  more  microarray 
experiments). 


Figure  3:  Properties  of  TRNs  used  in  the  synthetic  examples. 


Networks  that  consist  of  1000  genes  and  100  TFs  are  generated  using  the  probability  distribution  for  the  number  of 
genes  regulated  by  a  TF  shown  in  (a).  The  corresponding  probability  distribution  for  the  number  of  regulators  per 
gene  is  shown  in  (b).  The  average  number  of  regulators  per  gene  is  3.62,  5.22,  and  7.02  for  Networks  1,  2  and  3, 
respectively.  Equal  likelihood  is  chosen  for  up/down  regulation. 


a) 


Success  Rate 


b) 


Success  Rate 


Figure  4:  Effect  of  TRN  properties. 


We  used  Networks  1,  2  and  3  of  Fig.  3  to  generate  100  synthetic  expression  datasets,  and  eliminated  50%  of  the 
TF/gene  interactions  in  the  TRN.  Shown  is  the  percentage  of  the  deleted  network  recovered  as  a  function  of  success 
rate.  As  the  number  of  TF/gene  interactions  increases,  percentage  of  the  network  that  can  be  recovered  decreases,  b) 
Same  as  a)  except  we  used  Network  1  and  eliminated  25%,  50%,  and  75%  of  the  network.  As  one  would  intuitively 
expect,  higher  percentage  of  the  deleted  network  is  recoverable  when  a  more  complete  network  is  known. 
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Figure  5:  Reconstruction  of  TRNs. 


We  have  used  the  Network  1  of  Fig.  3  and  generated  synthetic  expression  data.  Then,  we  eliminated  50%  of  the 
network  (randomly),  and  used  FTF  to  reconstruct  the  deleted  network.  Fig.  a)  shows  the  percentage  of  the  deleted 
network  recovered  as  a  function  of  success  rate,  a  measure  of  the  likelihood  that  an  interaction  is  correct,  as 
estimated  from  the  training  set  (known  interactions).  As  the  number  of  microarray  experiments  increases,  a  higher 
percentage  of  the  network  can  be  reconstructed.  Flowever,  full  reconstruction  requires  too  many  experiments.  Fig.  b) 
shows  success  rate  as  a  function  of  the  absolute  value  of  the  linear  correlation  between  the  constructed  TF  activities 
and  gene  expression  data. 

The  essence  of  this  approach  is  to  estimate  TF  activities  from  a  preliminary  TRN  (training 
set)  and  expression  data.  Once  approximate  TF  activities  are  constructed,  we  calculate  their 
correlation  with  the  expression  profiles  of  the  genes  they  might  regulate,  and  rank  plausible 
TF/gene  interactions.  Results  from  synthetic  examples  using  a  network  of  1000  genes  and  100 
TFs  are  encouraging  (Figs.  4-5). 

Gene  Ontology 

We  use  the  biological  process  ontology  developed  by  the  Gene  Ontology  (GO)  consortium 
(i www.geneontology.org )  and  hypothesize  that  a  gene  pair  is  more  likely  to  be  regulated  in  the 
same  manner  as  the  similarity  between  their  GO  descriptions  increases.  As  a  gene  product  might 
be  assigned  multiple  GO  terms,  we  use  the  maximum  similarity  between  all  possible 
combinations.  Use  of  GO  similarity  has  already  been  shown  to  provide  information  about 
functional  modules  in  E.coli  (Wu  et  al.  2005).  We  have  extended  this  methodology  to  construct 
to  construct  TRNs.  Details  on  integration  of  GO  into  a  TRN  construction  strategy  are  given 
below. 

Phylogenic  Similarity  Analysis 

In  the  phylogenetic  profile  algorithm,  we  first  seek  orthologous  genes  (in  a  set  of  N  sequenced 
and  annotated  bacterial  genomes)  for  each  gene  in  the  bacterium  of  interest.  For  each  gene,  we 
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construct  a  vector  of  length  N  whose  z-th  element  is  assigned  0  (no  orthologous  gene  is  found  in 
bacterium  z)  or  n  (orthologous  gene  is  found  in  bacterium  z  and  its  order  in  the  genome  is  n).  The 
hypothesis  is  that  if  two  vectors  (for  a  gene  pair)  show  a  high  level  of  similarity,  this  gene  pair  is 
likely  to  be  similarly  regulated.  In  our  implementation  we  use  230  genomes  from 
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/.  As  operons  are  an  important  feature  of  bacterial 
genomes,  this  approach  is  very  likely  to  provide  functional  relationships  as  already  shown  for 
E.coli  (Wu  et  al.  2005).  The  vectors  noted  above  can  be  used  to  develop  various  measures  of 
similarity  that  yield  a  probability  for  the  accuracy  of  any  suggested  TF/gene  regulatory 
interaction  discovered.  TRN  enhancement  for  phylogenic  similarity  and  GO,  any  regulatory 
information  from  a  given  gene  is  taken  to  be  allocable  to  another  which  has  a  high  phylogenetic 
similarity  score. 

Multiple  TRN  Construction  Methodology 

Each  of  the  above  individual  method  provides  a  score  for  each  suggested  TF/gene  interaction. 
The  statistical  significance  of  the  score  is  assessed  by  the  ratio  of  the  probability  of  that  score  in 
the  training  set  to  that  in  the  random  set.  To  meet  the  objective  of  genome-wide  TRN  discovery 
we  seek  an  approach  that  integrates  sufficient  information  to  delineate  the  many  TF/gene 
interactions  and  to  eliminate  spurious  ones.  In  an  attempt  to  develop  an  objective  integration  of 
the  three  methods  (FTF,  phylogenic  similarity,  GO)  for  a  united  TRN  discovery  workflow,  we 
hypothesize  that  the  sum  of  the  logs  of  the  Bayesian-like  ratios  for  a  given  TF/gene  interaction 
provides  a  reliable  success  measure.  Application  of  the  approach  to  E.coli  is  shown  in  Fig.  6.  The 
results  are  posted  at  sysbio.indiana.edu.  The  microarray  dataset  was  gathered  from  NIH  omnibus 
service  (GSE7,  GSE8,  GSE9;  65  datasets). 
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Figure  6:  Probability  distribution  as  a  function  of  a)  GO,  b)  phylogenic  similarity,  c)  FTF  scores. 

In  comparison  with  Fig.  2  (correlation  based  on  gene-gene  correlations),  the  probability  distributions  for  the  training 
and  random  sets  can  be  easily  distinguished,  d)  the  probability  distributions  for  the  logarithm  of  the  multiplication  of 
Bayesian  ratios. 

We  have  applied  this  methodology  using  the  336  microarrays  on  B  cells  (GEO:  GSE  2350, 
submitted  by  K.  Basso)  using  a  preliminary  TRN  constructed  from  our  GeneDat  database,  FTF 
and  GO  similarity.  Over  15,000  predicted  TF/gene  interactions  were  discovered  and  are  posted  at 
http://systemsbiology.indiana.edu.  The  final  comparison  of  the  probability  distributions  for  the 
training  and  random  sets  are  shown  in  Fig.  7.  The  B  Cell  microarray  dataset  was  also  used  by 
Basso  et  al.  (2005)  to  make  predictions  for  the  MYC  TF.  Our  prediction  set  spans  489  TFs  for 
which  we  had  a  preliminary  TRN  in  the  GeneDat  database. 
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Figure  7:  Probability  distributions  for  the  final  score. 

Predictions  were  made  using  9589  genes  and  489  TFs  by  using  336  microarray  B  Cell  datasets  and  over  37  million 
gene-gene  GO  similarity  measure.  Over  15,000  TF/gene  pairs  were  found  to  be  statistically  significant.  Our 
predictions  are  available  at  http://systemsbiology.indiana.edu. 
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5.0  KAGAN:  Karyote  Gene  Analyzer 


Overview 

cDNA  microarray  and  other  multiplex  data  hold  promise  for  addressing  the  challenges  of  cellular 
complexity,  refined  diagnoses  and  the  discovery  of  well-targeted  treatments.  A  novel  approach  to 
the  construction  and  quantification  of  TRNs  was  developed  that  integrates  microarray  data  and 
cell  modeling  through  information  theory.  This  approach  was  implemented  as  the  KAGAN  Bio- 
SPICE  contribution.  KAGAN  complements  FTF  in  that  while  FTF  is  designed  to  construct  the 
structure  of  large  TRNs,  it  cannot  yield  information  about  the  physical  chemistry  of  the  network 
(i.e.  rate  and  binding  constants  of  genomic  processes);  conversely  KAGAN  is  more 
computationally  intensive  so  that  its  practical  use  is  for  detennining  the  physical  chemistry  for  a 
network  of  mostly  known  structure.  This  suggests  that  the  TRN  constructed  via  the  multiple 
method/FTF  workflow  of  the  previous  section  can  serve  as  input  for  KAGAN  wherein  this  TRN 
will  be  refined  and  quantified. 

Given  a  partial  transcriptional  regulatory  network  (TRN)  and  time  series  cDNA  microarray 
data,  a  probability  density  is  constructed  that  is  a  functional  of  the  time  course  of  TF 
thennodynamic  activities  at  the  site  of  gene  control,  and  is  a  function  of  mRNA  degradation  and 
transcription  rate  coefficients,  and  equilibrium  constants  for  TF/gene  binding.  A  kinetic  (and  not 
a  steady-state)  formulation  facilitates  the  analysis  of  phenomena  with  a  strongly  dynamical 
character  (e.g.  the  cell  cycle,  metabolic  oscillations,  viral  infection  or  response  to  changes  in  the 
extra-cellular  medium).  Our  KAGAN  approach  yields  more  physical-chemical  information  that 
compliments  the  results  of  network  structure  delineation  methods,  and  thereby  can  serve  as  an 
element  of  a  more  comprehensive  TRN  discovery/quantification  workflow.  The  most  probable 
TF  time  courses  and  values  of  the  aforementioned  parameters  are  obtained  by  maximizing  the 
probability.  As  the  time  course  of  the  activity  of  a  TF  is  computed  by  probability  functional 
maximization,  and  is  not  assumed  to  be  proportional  to  expression  level  of  the  mRNA  type  that 
encodes  the  TF,  observed  time  delays  between  mRNA  expression  and  TF  activity  are  accounted 
for.  This  allows  one  to  investigate  post-translational  and  TF  activation  mechanisms  of  gene 
regulation.  Accuracy  and  robustness  of  the  method  are  evaluated.  A  physically-motivated 
regularization  of  the  TF  time  course  is  found  to  overcome  difficulties  due  to  omnipresent  noise 
and  data  sparsity  that  plague  any  methods  of  microarray  data  analysis. 
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cDNA  microarray  (Schena  et  al.  1995;  DeRisi  et  al.  1997;  Sauter  et  al.  2003)  and  other 
multiplex  data  (e.g.  NMR  and  proteomics)  contain  a  wealth  of  information,  and  thereby  hold 
promise  for  addressing  the  challenge  of  cellular  complexity  and  deriving  advances  in  medical 
sciences  that  could  follow  from  it  (Brown  and  Botstein  1999;  Debouck  and  Goodfellow  1999; 
Gerhold  et  al.  1999;  Chitler  2004).  Considering  the  volume  of  the  data  and  the  complexity  of  the 
phenomena  to  be  understood,  it  is  evident  that  methods  for  the  interpretation  of  such  multiplex 
data  must  be  facilitated  by  automation.  Recently  we  proposed  an  approach  to  the  analysis  of 
multiplex  bioanalytical  data  based  on  its  integration  with  cell  modeling  through  information 
theory  (Sayyed- Ahmad  et  al.  2003).  Here  we  show  how  this  approach  can  be  extended  to  the 
analysis  of  microarray  time  series  data. 

The  objective  of  KAGAN  is  to  predict  TF  time  courses  and  obtain  estimates  of  biochemical 
rate  and  binding  constants  for  transcription  and  RNA  degradation.  KAGAN  accomplishes  this 
despite  omnipresent  noise  in  microarray  data  and  the  lack  of  a  complete  knowledge  of  the 
detailed  biochemistry  of  TF  formation/degradation/activation  processes.  Using  time  series  RNA 
expression  data,  this  module  yields  a  large  volume  of  information  on  the  genome  that  can  be 
used  to  discriminate  cell  lines,  i.e.  even  for  those  with  the  same  TRN  but  with  differences  in  the 
kinetic  parameters  due  to  small  gene  sequence  variations  could  have  dramatic  consequences  for 
cell  behavior  (e.g.  the  onset  and  progression  of  cancer  or  the  resistance  of  a  macrophage  to 
infection  of  a  B.  anthracis  spore. 


Transcription  Kinetics 

In  our  kinetic  methodology,  it  is  assumed  that  gene  i  (  /  =  1, 2,  •  •  •  Ng )  has  TF  binding  sites 
labeled  j  =  1,2,- There  are  N rj,  TFs  labeled  n  =  1,2,- •  NTF .  It  is  assumed  that  a  unique 
TF  (denoted  n. . )  will  have  appreciable  affinity  for  site  j  on  gene  i  (i.e.  competitive  binding  is 
ignored).  Assuming  the  binding  at  any  site  is  independent  of  others,  the  rate  coefficient  k  for 
RNA  polymerase  (RP)  complexing  with  gene  i  is  taken  to  be 


k.  =  kn 


n 


Q..T 


V  n 


H)/ 


m  1  +  Q  T 


(6) 
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where  b..  =  ±1  for  up/down  regulation  and  Q.  is  the  binding  constant  for  site  j  on  gene  i. 

V  lJ 

Assuming  that  RNA  polymerase  binding  on  the  gene  is  rate  limiting  for  transcription,  and 
adopting  first  order  degradation  for  RNA,  we  write 

dR 

—L  =  k.[RP]-X.R.,  (7) 

dt  1  1  1 

[RP]  being  the  activity  of  free  RNA  polymerase  (assumed  constant  and  is  thus  subsumed  in 
kmax  henceforth);  R:  is  the  number  of  RNA  molecules  per  cell  transcribed  from  gene  i,  and  /l,. 

includes  a  dependence  on  RNA  length  (Beehnan  and  Parker  1995). 

The  assumptions  noted  above  were  made  in  order  to  create  a  robust  Bio-SPICE  module.  As 
these  assumptions  are  relaxed  new  phenomenological  parameters  must  be  introduced,  putting 
more  demands  on  the  sparse,  noisy  microarray  data  analysis.  However,  we  are  continuing  to 
develop  KAGAN  so  that  in  ongoing  research  we  are  testing  versions  with  competitive  binding 
and  other  of  the  aforementioned  ignored  effects. 

If  the  initial  RNA  level  R,(0)  is  used  as  the  control  data  in  a  time  series  experiment,  one 
obtains 


dmsyn  k. 

— -  X.msyn 
dt  R.(0)  1  1 


(8) 


where  msyn(t )  =  Rj(t)/  RiO)  is  the  model-predicted  time-dependent  microarray  response.  This 


implies  that  in  addition  to  the  TF  activity  time  courses,  kmax  /  R  (0)  and  Xt  appear  as  independent 

parameters  that  can  be  determined  for  each  gene. 

The  power  of  our  information  theory  approach  is  that,  despite  the  incompleteness  of  the 
model,  we  can  correct  and  augment  the  TRN,  and  extract  the  set  of  parameters  and  TF  time 
courses  T  (/  )  from  microarray  time  series  data  (Tuncay  and  Ortoleva  2002;  Tandon  et  al.  2003; 

and  Sayyed-Ahmad  et  al.  2003,  in  related  problems).  Novel  features  of  this  approach  are 

•  the  independent  computation  of  the  Q,  kmax,  X  and  b,  yielding  much  more  infonnation 


about  TF  populations  and  TF/gene  interactions  than  other  approaches; 

•  the  use  of  a  physically-motivated  regularization  technique  that  filters  short  timescale 
noise  from  microarray  data; 
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•  the  intra-nuclear  TF  activities  for  a  eukaryotic  cell  over  time  T(l  )  and  independent 
monitoring  of  their  level  in  the  cytoplasm  yields  constraints  on  other  TF  process  timescales 
(e.g.  permeation  of  the  nuclear  membrane,  protein  complexing  in  the  cytoplasm)  and 
similarly  for  bacteria;  and 

•  the  ability  to  assess  the  uncertainty  in  all  predictions. 

In  our  information  theory  approach,  we  construct  probability  p  for  the  Q,A,k™  and 

b  (collectively  denoted  A)  and  the  time-dependence  of  TF  activities  (collectively  denoted 
T)  using  the  maximum  entropy  principle  (Shannon  and  Weaver  1949;  Jaynes  1957).  We 
introduce  a  measure  of  the  error  in  the  predicted  versus  observed  microarray  response.  Let 
M  '  be  the  microarray  expression  level  for  the  i- th  of  Ng  genes  in  the/-th  of  NcDNA 

experiments  (i.e.  time  slice  or  extra-cellular  condition).  The  microarray  error  EcDNA  is 
defined  via 


1  ycDNA  g 

Ecdna=  £ 


(9) 


=i  i=i 


where  m-  =  M-  / M?  with  /=  A  being  the  initial  time  or  standard  condition;  h(x,y)  for  any  x,y 


2 

provides  an  error  metric  (e.g.  h(x,y )  =  (x-y)  ). 

Constructing  entropy  with  the  p  -weighted  average  of  EcDNA  and  information  on  the  time 
scale  on  which  T(t)  can  evolve  (our  regularization  condition),  we  obtain  p  and  then  maximize  it 
with  respect  to  A  and  T_(t)  to  detennine  the  most  probable  A  values  and  T  timecourses. 


Application  to  E.  coli 

E.coli  microarray  data  obtained  for  the  transition  from  glucose  to  acetate  media  (Kao  et  al.  2004) 
was  used  to  demonstrate  this  approach.  The  data  included  expression  levels  (relative  to  the  initial 
state)  of  100  genes  at  300,  900,  1800,  3600,  7200,  10800,  14400,  18000  and  21600  seconds.  The 
preliminary  TRN  used  was  based  on  RegulonDB  (Salgado  et  al.  2001)  as  modified  by  Kao  et  al. 
(2004).  Fig.  8  shows  the  time  courses  of  16  TFs  (out  of  38).  Kao  et  al.  (2004)  applied  their  NCA 
code  (Liao  et  al.  2003)  to  the  same  problem;  however,  the  TRN  used  (that  consists  of  100  genes 
and  38  TFs)  does  not  satisfy  the  NCA  column  rank  requirement.  Furthermore,  the  transcription 
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kinetics  in  our  approach  differs  from  that  of  NCA.  Despite  these  differences,  it  is  surprising  that 
15  out  of  16  TF  activity  time  courses  (Kao  et  al.  2004  only  presented  16)  are  in  qualitative 
agreement  with  our  results. 


5000  10000  15000  20000 

Time  (sec) 


5000  10000  15000  20000 

Time  (sec) 


Figure  8:  Predicted  TF  activity  time  courses  for  16  of  38  TFs  constructed  using  our  module  of  C3  and  a 
preliminary  TRN  (from  www.ecocyc.com  and  gene  expression  data). 


Results  are  in  qualitative  agreement  with  those  obtained  by  Kao  et  al.  (2004)  except  for  PhoB.  pstC  and  ptS  are 
upregulated  by  PhoB  and  their  level  of  expression  increases  in  time,  therefore  one  would  expect  the  activity  of  PhoB 
to  increase  as  well. 
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6.0  GeneDat:  Human  and  Bacterial  Transcriptional  Regulatory  Database 


GeneDat  is  a  database  of  TRN  information  created  to  be  a  part  of  an  automated  TRN  discovery 
system  that  includes  FTF  and  KAGAN  ( sysbio.indiana.edu ).  Users  can  enter  their  cDNA 
microarray  data,  obtain  an  associated  training  set  from  GeneDat,  and  construct  a  TRN  consistent 
with  the  microarray  data. 

A  preliminary  TRN  is  used  in  our  approach  to  start  the  FTF,  KAGAN,  and  bioinfonnatics 
modules.  For  a  list  of  genes  in  a  subnetwork  (e.g.  as  identified  using  microarray  data)  regulatory 
TFs  and  a  list  of  the  genes  that  encode  them  (or  components  thereof)  are  extracted  from  a 
database.  Existing  databases  do  not  provide  all  up/down  regulatory  interactions  explicitly  (i.e. 
the  user  is  often  referred  to  the  citations)  and  entries  for  specific  pathways  of  interest  are  often 
missing.  We  have  established  a  database  of  mammalian  (mostly  human)  TF/gene  regulatory 
up/down  interactions.  This  GeneDat  database  has  over  13,000  TF/gene  experimentally-verified 
regulatory  relationships.  The  TFs  in  the  database  are  the  single  or  multi-component  active  forms 
(as  far  as  is  known  or  given  in  the  references).  GeneDat  also  records  the  genes  which  are 
translated  into  the  TF  component  proteins.  GeneDat  is  annotated  with  gene  and  TF  organisms, 
cell  lines  and  literature  citations.  Extensive  alias  tables  for  genes  and  TFs  remove  redundancy. 
Data  has  been  gathered  and  curated  from  a  variety  of  databases  and  the  literature.  The  former 
include  TRANSFAC  ( www.gene-regidation.com ),  National  Center  for  Biotechnology 
Information  ( www.ncbi.nlm.nih.gov ),  Protein  Lounge  ( www.proteinlounge.com ),  and 
Transcriptional  Regulatory  Regions  Database  ( www.mgs.bionet.nsc.ru/mgs/gnw/trrd ).  This  data 
is  curated  and  reformatted  in  order  to  allow  GeneDat  to  be  efficiently  folded  into  an  automated 
TRN  discovery  system.  TRN  results  for  E.coli  and  B  Cell,  based  on  the  GeneDat  preliminary 
TRN,  were  obtained  using  the  TRN  discovery  methodology  discussed  earlier. 

We  have  also  gathered  data  on  several  bacteria  to  serve  as  a  training  set,  and  facilitate  the 
construction  of  TRN.  TRNs  for  B.  subtilis  (by  DBTBS,  http://dbtbs.hgc.jp )  and  for  E.  coli 
(i www.ecocyc.org )  with  augmentation  using  information  from  Regulon  DB 
( http://www.cifi.wiam.mx/ComputatioJial_GeJiomics/reguIondb/ ). 
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7.0  CellX:  Multi-Dimensional  Cell  Model 

Overview 


The  CellX  cell  simulator  accounts  for  reaction  and  transport  processes  and  the  attendant 
intracellular  gradients  of  composition.  CellX  is  multi-dimensional  -  i.e.  reaction-transport 
equations  are  solved  on  membranes  (2-D)  and  within  bulk  media  (3-D),  all  simultaneously  and 
with  full  coupling  that  accounts  for  molecular  exchange  between  these  domains.  Equations  are 
solved  on  finite  element  grids  to  yield  the  timecourse  of  cell  state. 

In  the  implementation  provided  to  Bio-SPICE  we  include  reactions  and  transport  on  the  inner 
surface  of  the  outer  membrane  of  a  bacterium;  in  particular  the  model  accounts  for  surface 
processes  associated  with  the  dynamics  of  Min  proteins  and  their  transport  within,  and  exchange 
with,  the  cell’s  interior  continuum. 

Multi-Dimensional  Model  Formulation 


The  objective  of  the  modeling  underlying  CellX  as  described  below  is  to  address  the  hierarchical 
complexity  of  intra-cellular  structure  and  dynamics.  Intra-cellular  structural  detail  accounted  for 
simultaneously  is  the  bulk  medium  (3-D)  and  the  2-D  levels  (e.g.  along  membrane  surfaces  or 
interiors).  Thus  we  term  our  approach  multi-dimensional.  Through  this  approach,  we  simulate 
directed  transport,  well-localized  functions  and  other  key  phenomena.  As  for  other  complex 
systems,  the  art  of  cell  modeling  is  in  the  choice  of  the  level  of  the  description. 

In  the  most  general  formulation,  a  CellX  model  is  divided  into  compartments  labeled 

a  =  1, 2,--Nc  separated  by  membranes.  For  Bio-SPICE  we  implemented  a  single  compartment 

version. 

The  reaction-transport  differential  equations  on  which  the  Bio-SPICE  version  of  CellX  is 
based  are,  schematically, 


U'-'3D  _  r>  rv2  pi  r> 

^  ~  LJ3Dy  3D^3D  f'SD 

Ot 


2D  _  p,  vt2  pi  r>  i  D 

q  ^2 D  V  2D^2D  2D  "l~  ^DIID 


n  *  ^3 bCjo  ^2D/3D 


(10) 

(11) 

(12) 


where 
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R2d  ,  R3D  =  net  reaction  rates  for  surface  and  bulk  reactions 

R2d/3d  =  ncl  rate  of  molecular  exchange  between  the  interior  bulk  and  the  membrane  surface 
D2D ,  D3d  =  diffusion  coefficient  for  the  surface  and  bulk 
C2D=  concentration  at  the  membrane  surface  (molecules/area) 

C3D  =  concentration  in  the  cell  interior  bulk  (molecules/volume). 

The  net  rates  are  related  through  general  stoichiometric  matrices  to  mass  action  rate  laws  for 
each  fundamental  process.  Finally,  h  is  the  unit  normal  to  the  membrane  pointing  into  the  cell 
interior. 

Application  to  E.  coli  Division 

CellX  was  used  to  simulate  the  self-organized  location  of  the  division  plane  in  E.  coli.  This 
plane  is  believed  to  form  where  the  time-averaged  surface-adsorbed  concentration  of  MinD 
protein  is  a  minimum.  Fig.  9  shows  an  example  wherein  multiple  division  planes  are  predicted 
for  abnormally  long  cells,  as  observed.  CellX  was  also  applied  to  spherical  E.  coli  cells  and 
bursting  patterns  of  Min  protein  localization  were  predicted. 


MinD  Cone  Profile  (normal  length) 


membrane  position  from  south  to  north  (pm) 


Figure  9a:  Surface  pole-to-pole 
MinD  concentration  profile  for 
a  normal  length,  rod  shaped  E.coli 
cell,  suggesting  one  division  plane  at 
the  middle.  This  time-average  profile 
was  obtained  by  sampling  all  nodes 
within  a  ring  of  0. 1pm  width  over  the 
long  axis. 


Figure  9a:  Surface  pole-to-pole  MinD  concentration  profile  for  a  normal  length,  rod  shaped  E.coli  cell, 

suggesting  one  division  plane  at  the  middle. 
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MinD  Cone  Profile  (1.5  length  of  normal) 


-1.5  -1.0  -0.5  0.0  0.5  1.0  1.5 

membrane  position  from  south  to  north  (pm) 


Figure  9b  Same  as  (a)  except  for  a  1.5  x  normal  length  cell 

(i.e.  3.0  pm  in  length  and  0.5  pm  in  diameter)  indicating  two  division  planes.  The  sampling  interval  was  0. 1  pm. 


Figure  9c  Same  as  (a)  except  for  a  2  x  normal  length  cell 

(i.e.  4.0  pm  in  length  and  0.5  pm  in  diameter)  suggesting  3  division  planes.  The  sampling  interval  was  0.2  pm. 
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8.0  Conclusions 


A  number  of  systems  biology  modules  were  created  for  Bio-SPICE.  These  include  two  cell 
modeling  systems  (Karyote  and  CellX)  and  two  data/model  integrators  (FTF  and  KAGAN).  In 
support  of  these  modules  we  created  a  website  ( sysbio.indiana.edu )  that  provides  model  building 
resources. 

Comparison  of  results  from  our  Karyote  and  CellX  results  with  observation  shows  that  they 
are  viable  instruments  for  predicting  cell  behavior.  Similar  comments  hold  for  the  ability  of  FTF 
and  KAGAN  to  reliably  construct  and  refine  transcriptional  regulatory  networks. 

The  Bio-SPICE  project  was  very  successful  in  demonstrating  the  feasibility  of  creating  a 
platform  to  use  interoperable  systems  biology  modules  in  an  automated  workflow.  However,  in 
developing  and  installing  these  modules  we  concluded  that  the  Dashboard  and  other  Bio-SPICE 
infrastructure  were  somewhat  difficult  to  use  although  the  overall  concept  holds  great  promise. 

Cell  modeling  and  computational  biology  in  general  are  still  at  an  early  stage  of 
development.  Thus  a  larger  investment  in  developing  the  physico-chemical  models,  as  opposed 
to  general  structural  concerns,  might  have  led  to  greater  progress.  We  concluded  that  the 
requirements  of  the  science  should  guide  the  infrastructure  effort  more  directly. 

We  conclude  that  a  follow-on  project  should  be  focused  on  the  development  of  the  more 
detailed  physico-chemically  based  cell  models. 
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9.0  Recommendations 


To  realize  the  full  potential  of  Bio-SPICE  it  is  necessary  to  find  a  follow-on  project  that  is  led  by 
researchers  developing  the  next  generation  of  systems  biology  modules.  The  team  should  also 
include  experimentalists,  mathematicians,  computational  and  physical  scientists. 

Important  issues  are 

•  multi-scale  modeling 

•  extremely  large  numbers  of  variables 

•  calibration  of  complex  models 

•  mechanics  of  cell  shape  and  locomotion 

•  multiplex  data/model  integration 

•  model  builder  modules  for  3-D  and  chemically  complex  models 

•  methods  to  examine  complex  model  output 

•  integration  software  for  models  and  treatment  discovery 

These  issues  could  constitute  a  new  DoD  initiative  as  most  other  programs  tend  to  assume  the 
models  can  be  developed  elsewhere. 

Viral  threats  were  not  addressed  in  Bio-SPICE.  We  suggest  that  a  new  initiative  be  launched 
that  is  on  the  scale  of  Bio-SPICE.  The  objective  is  to  implement  a  workflow  that  takes  a  viral 
gene  sequence  and  yields  drug  targets,  vaccines,  and  side  effect-free  treatment  strategies.  The 
types  of  modules  developed  should  allow  for  a  spectrum  of  topics  that  include 

•  protein-protein  interaction 

•  all-atom  whole  virus  prediction 

•  viral/host  cell  membrane  interactions 

•  viral/host  transcriptional  regulatory  network  interactions 

•  mutation  dynamics 

•  viral  life  cycle 

•  virus-like  nanoparticles  for  drug  delivery 

•  drug  target  discovery  and  vaccine  design 

It  is  suggested  that  this  initiative  be  planned  by  a  panel  of  virologists,  clinicians,  modelers,  and 
b  io  informa  ticists . 
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